extraction attack
Extracting Training Data from Molecular Pre-trained Models
This work, for the first time, explores the risks of extracting private training molecular data from molecular pre-trained models. This task is nontrivial as the molecular pre-trained models are non-generative and exhibit a diversity of model architectures, which differs significantly from language and image models.
RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation
Li, Qinfeng, Pan, Miao, Xiong, Ke, Su, Ge, Shen, Zhiqiang, Liu, Yan, Sun, Bing, Peng, Hao, Zhang, Xuhong
Retrieval-Augmented Generation (RAG) systems deployed over proprietary knowledge bases face growing threats from reconstruction attacks that aggregate model responses to replicate knowledge bases. Such attacks exploit both intra-class and inter-class paths, progressively extracting fine-grained knowledge within topics and diffusing it across semantically related ones, thereby enabling comprehensive extraction of the original knowledge base. However, existing defenses target only one path, leaving the other unprotected. We conduct a systematic exploration to assess the impact of protecting each path independently and find that joint protection is essential for effective defense. Based on this, we propose RAGFort, a structure-aware dual-module defense combining "contrastive reindexing" for inter-class isolation and "constrained cascade generation" for intra-class protection. Experiments across security, performance, and robustness confirm that RAGFort significantly reduces reconstruction success while preserving answer quality, offering comprehensive defense against knowledge base extraction attacks.
Extracting Training Data from Molecular Pre-trained Models
This work, for the first time, explores the risks of extracting private training molecular data from molecular pre-trained models. This task is nontrivial as the molecular pre-trained models are non-generative and exhibit a diversity of model architectures, which differs significantly from language and image models.
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries
Wang, Yuhao, Qu, Wenjie, Zhai, Shengfang, Jiang, Yanze, Liu, Zichen, Liu, Yue, Dong, Yinpeng, Zhang, Jiaheng
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but this may expose them to extraction attacks, leading to potential copyright and privacy risks. However, existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection. In this paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts Knowledge Extraction on RAG systems through benign queries. Specifically, IKEA first leverages anchor concepts-keywords related to internal knowledge-to generate queries with a natural appearance, and then designs two mechanisms that lead anchor concepts to thoroughly "explore" the RAG's knowledge: (1) Experience Reflection Sampling, which samples anchor concepts based on past query-response histories, ensuring their relevance to the topic; (2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space. Extensive experiments demonstrate IKEA's effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90% in attack success rate. Moreover, the substitute RAG system built from IKEA's extractions shows comparable performance to the original RAG and outperforms those based on baselines across multiple evaluation tasks, underscoring the stealthy copyright infringement risk in RAG systems.
Train to Defend: First Defense Against Cryptanalytic Neural Network Parameter Extraction Attacks
Neural networks are valuable intellectual property due to the significant computational cost, expert labor, and proprietary data involved in their development. Consequently, protecting their parameters is critical not only for maintaining a competitive advantage but also for enhancing the model's security and privacy. Prior works have demonstrated the growing capability of cryptanalytic attacks to scale to deeper models. In this paper, we present the first defense mechanism against cryptanalytic parameter extraction attacks. Our key insight is to eliminate the neuron uniqueness necessary for these attacks to succeed. We achieve this by a novel, extraction-aware training method. Specifically, we augment the standard loss function with an additional regularization term that minimizes the distance between neuron weights within a layer. Therefore, the proposed defense has zero area-delay overhead during inference. We evaluate the effectiveness of our approach in mitigating extraction attacks while analyzing the model accuracy across different architectures and datasets. When re-trained with the same model architecture, the results show that our defense incurs a marginal accuracy change of less than 1% with the modified loss function. Moreover, we present a theoretical framework to quantify the success probability of the attack. When tested comprehensively with prior attack settings, our defense demonstrated empirical success for sustained periods of extraction, whereas unprotected networks are extracted between 14 minutes to 4 hours.
A Survey on Model Extraction Attacks and Defenses for Large Language Models
Zhao, Kaixiang, Li, Lincan, Ding, Kaize, Gong, Neil Zhenqiang, Zhao, Yue, Dong, Yushun
Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
Wang, Kun, Zhang, Guibin, Zhou, Zhenhong, Wu, Jiahao, Yu, Miao, Zhao, Shiqian, Yin, Chenlong, Fu, Jinhu, Yan, Yibo, Luo, Hanjun, Lin, Liang, Xu, Zhihao, Lu, Haolang, Cao, Xinye, Zhou, Xinyun, Jin, Weifei, Meng, Fanci, Xu, Shicheng, Mao, Junyuan, Wang, Yu, Wu, Hao, Wang, Minghe, Zhang, Fan, Fang, Junfeng, Qu, Wenjie, Liu, Yue, Liu, Chengwei, Zhang, Yifan, Li, Qiankun, Guo, Chongye, Qin, Yalan, Fan, Zhaoxin, Wang, Kai, Ding, Yi, Hong, Donghai, Ji, Jiaming, Lai, Yingxin, Yu, Zitong, Li, Xinfeng, Jiang, Yifan, Li, Yanhui, Deng, Xinyu, Wu, Junlin, Wang, Dongxia, Huang, Yihao, Guo, Yufei, Huang, Jen-tse, Wang, Qiufeng, Jin, Xiaolong, Wang, Wenxuan, Liu, Dongrui, Yue, Yanwei, Huang, Wenke, Wan, Guancheng, Chang, Heng, Li, Tianlin, Yu, Yi, Li, Chenghao, Li, Jiawei, Bai, Lei, Zhang, Jie, Guo, Qing, Wang, Jingyi, Chen, Tianlong, Zhou, Joey Tianyi, Jia, Xiaojun, Sun, Weisong, Wu, Cong, Chen, Jing, Hu, Xuming, Li, Yiming, Wang, Xiao, Zhang, Ningyu, Tuan, Luu Anh, Xu, Guowen, Zhang, Jiaheng, Zhang, Tianwei, Ma, Xingjun, Gu, Jindong, Pang, Liang, Wang, Xiang, An, Bo, Sun, Jun, Bansal, Mohit, Pan, Shirui, Lyu, Lingjuan, Elovici, Yuval, Kailkhura, Bhavya, Yang, Yaodong, Li, Hongwei, Xu, Wenyuan, Sun, Yizhou, Wang, Wei, Li, Qing, Tang, Ke, Jiang, Yu-Gang, Juefei-Xu, Felix, Xiong, Hui, Wang, Xiaofeng, Tao, Dacheng, Yu, Philip S., Wen, Qingsong, Liu, Yang
The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full-stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Hu, Yingqi, Zhang, Zhuo, Zhang, Jingyuan, Qu, Lizhen, Xu, Zenglin
Federated fine-tuning of large language models (FedLLMs) presents a promising approach for achieving strong model performance while preserving data privacy in sensitive domains. However, the inherent memorization ability of LLMs makes them vulnerable to training data extraction attacks. To investigate this risk, we introduce simple yet effective extraction attack algorithms specifically designed for FedLLMs. In contrast to prior "verbatim" extraction attacks, which assume access to fragments from all training data, our approach operates under a more realistic threat model, where the attacker only has access to a single client's data and aims to extract previously unseen personally identifiable information (PII) from other clients. This requires leveraging contextual prefixes held by the attacker to generalize across clients. To evaluate the effectiveness of our approaches, we propose two rigorous metrics-coverage rate and efficiency-and extend a real-world legal dataset with PII annotations aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified precision. Experimental results show that our method can extract up to 56.57% of victim-exclusive PII, with "Address," "Birthday," and "Name" being the most vulnerable categories. Our findings underscore the pressing need for robust defense strategies and contribute a new benchmark and evaluation framework for future research in privacy-preserving federated learning.